Summary statistics of cities data

The variable “burglary”

library(classdata)
A = cities$Burglary

mean(A, na.rm   = TRUE)
## [1] 112.6739
sd(A, na.rm = TRUE)
## [1] 273.271

Questions for cities data

Please see the questions in the folder “Questions for cities data” in this repository.

Making plots for fbiwide data

library(ggplot2)

ggplot(fbiwide, aes(x = log(burglary), y = log(robbery))) + geom_point()

ggplot(fbiwide, aes(x = log(burglary), y = log(motor_vehicle_theft))) + geom_point()

# Add color by years

ggplot(fbiwide, aes(x = log(burglary), y = log(motor_vehicle_theft), colour = year)) + geom_point()

Questions in the lecture “02_r-graphics”

  1. Draw a scatterplot of the log transformed number of burglaries by motor vehicle thefts. Map the state variable to colour. Why is this a terrible idea?
ggplot(data = fbiwide, aes(x = log(burglary), y = log(motor_vehicle_theft))) + geom_point(aes(color = state))

This is a bad idea because too many states and colors can not separate each of them.

  1. Draw a scatterplot of the log transformed number of burglaries by motor vehicle thefts. Map population to size. How do we interpret the output?
ggplot(data = fbiwide, aes(x = log(burglary), y = log(motor_vehicle_theft))) + geom_point(aes(color = state, size = population))

  1. Have a look at the RStudio cheat sheet on visualization. Draw a histogram of the state populations
ggplot(data = fbiwide, aes(x = population)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The above histogram does not make sense as it includes multiple years for the same state.

We make the histogram of popualtion for 2019.

ggplot(data = fbiwide[fbiwide$year == 2019, ], aes(x = population)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  1. Compare the log transformed number of burglaries by motor vehicle thefts over years. How to make a nice plot?
fbiwide1 = fbiwide
fbiwide1 = fbiwide1[fbiwide1$year >= 1980, ]
fbiwide1 = fbiwide1[fbiwide1$year < 2020, ]
fbiwide1$decade = floor(fbiwide1$year / 10)
fbiwide1$decade = paste0(fbiwide1$decade, "0s")

ggplot(data = fbiwide1, aes(x = log(burglary), y = log(motor_vehicle_theft))) + geom_point(aes(color = decade))

  1. Compare the log transformed number of burglaries by motor vehicle thefts over States, coloured by years.
ggplot(data = fbiwide1, aes(x = log(burglary), y = log(motor_vehicle_theft))) + geom_point(aes(color = decade)) + facet_wrap(.~ state_abbr)

  1. Now, only focus on comparing California, Colorado, Iowa, Illinois, District of Columbia and New York.
state.subset = c("California", "Colorado", "Iowa", "Illinois", "District of Columbia", "New York")

fbiwide2 = fbiwide1[fbiwide1$state %in% state.subset, ]
ggplot(data = fbiwide2, aes(x = log(burglary), y = log(motor_vehicle_theft))) + geom_point(aes(color = decade)) + facet_wrap(.~ state)

  1. We all know population is an important factor. How to compare different states by standardized population?
ggplot(data = fbiwide2, aes(x = log(burglary / population * 66424), y = log(motor_vehicle_theft / population * 66424))) + geom_point(aes(color = decade)) + facet_wrap(.~ state)

Questions in the lecture “03_r-graphics”

“Facet” option in ggplot

  1. Plot the number of car thefts by year for each state (facet by state)
ggplot(data = fbiwide, aes(x = year, y = motor_vehicle_theft)) + geom_point() + facet_wrap(~state)

  1. The numbers are dominated by the number of thefts in California, New York, and Texas. Use a log-scale for the y-axis.
ggplot(data = fbiwide, aes(x = year, y = log(motor_vehicle_theft))) + geom_point() + facet_wrap(~state)

  1. Give each panel its own scale
ggplot(data = fbiwide, aes(x = year, y = motor_vehicle_theft / population)) + geom_point() + facet_wrap(~state, scale = "free_y")

  1. Using ggplot2, draw side-by-side boxplots of the number of robberies by state. Use a log transformation on y and compare results
# use fbiwide data
ggplot(data = fbiwide, aes(x = state, y = log(robbery / population))) + geom_boxplot() + coord_flip()

# use fbi data
ggplot(data = fbi[fbi$type == "robbery", ], aes(x = state, count / population)) + geom_boxplot() + coord_flip()

  1. How to make boxplots comparing across states for different types of crimes?
ggplot(data = fbi[fbi$type %in% c("homicide", "arson", "rape_legacy"), ], aes(x = state, y = count / population)) + geom_boxplot() + coord_flip() + 
  facet_wrap(~type)
## Warning: Removed 212 rows containing non-finite values (`stat_boxplot()`).

Boxplot of count vs type of crimes, facet by states

ggplot(data = fbi, aes(x = type, y = count / population)) + geom_boxplot() + facet_wrap(~state, scales = "free_y")
## Warning: Removed 1960 rows containing non-finite values (`stat_boxplot()`).

Histogram

specify number of bins:

ggplot(fbiwide, aes(x = motor_vehicle_theft)) + 
  geom_histogram(bins = 100) + ggtitle("number of bins = 100")

specify the binwidth:

ggplot(fbiwide, aes(x = motor_vehicle_theft)) + 
  geom_histogram(binwidth = 5000) + ggtitle("binwidth = 5000")

density plot by histogram:

ggplot(fbiwide, aes(x = motor_vehicle_theft)) + 
  geom_histogram(aes(y = ..density..))
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  1. A barchart of the variable Violent Crime. Make the height of the bars dependent on the number of reports (use weight). Then facet by type (does the result match your expectation? good! get rid of facetting). Color bars by type.
ggplot(data = fbi, aes(x = violent_crime)) + 
  geom_bar(aes(weight = count))